A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data

نویسندگان

  • Khaled Alsabti
  • Sanjay Ranka
  • Vineet Singh
چکیده

The cpquantile of an ordered sequence of data values is the element with rank ‘pn, where n is the total number of values. Accurate estimates of quantiles are required for the solution of many practical problems. In this paper, we present a new algorithm for estimating the quantile values for disk-resident data. Our algorithm has the following characteristics: (1) It requires only one pass over the data; (2) It is deterministic; (3) It produces good lower and upper bounds of the true values of the quantiles; (4) It requires no a priori knowledge of the distribution of the data set; (5) It has a scalable parallel formulation; (6) Extra time and memory for computing additional quantiles (beyond the first one) are constant per quantile. We present experimental results on the IBM SP-2. The experimental results show that the algorithm is indeed robust and does not depend on the distribution of the data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

[7] A. Asuncion and D. J. Newman. UCI Machine Learning Repository

[3] Rakesh Agrawal and Arun Swami. A one-pass space-efficient algorithm for finding quantiles. A one-pass algorithm for accurately estimating quantiles for disk-resident data. [8] Jürgen Beringer and Eyke Hüllermeier. An efficient algorithm for instance-based learning on data streams.

متن کامل

Novel Algorithms for Computing Medians and Other Quantiles of Disk-Resident Data

In data warehousing applications, numerous OLAP queries involve the processing of holistic operations such as computing the "top N", median, etc. Efficient implementations of these operations are hard to come by. Several algorithms have been proposed in the literature that estimate various quantiles of disk-resident data. Two such recent algorithms are based on sampling. In this paper we presen...

متن کامل

How to Summarize the Universe: Dynamic Maintenance of Quantiles

Order statistics, i.e., quantiles, are frequently used in databases both at the database server as well as the application level. For example, they are useful in selectivity estimation during query optimization, in partitioning large relations , in estimating query result sizes when building user interfaces, and in characterizing the data distribution of evolving datasets in the process of data...

متن کامل

Estimating Quantiles from the Union of Historical and Streaming Data

Modern enterprises generate huge amounts of streaming data, for example, micro-blog feeds, financial data, network monitoring and industrial application monitoring. While Data Stream Management Systems have proven successful in providing support for real-time alerting, many applications, such as network monitoring for intrusion detection and real-time bidding, require complex analytics over his...

متن کامل

Estimating Aggregate Properties on Probabilistic Streams

The probabilistic-stream model was introduced by Jayram et al. [16]. It is a generalization of the data stream model that is suited to handling \probabilistic" data where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over potentially a very large number of classical \deterministic" streams...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1997